R programming
Post to Twitter R-administration All stuff about R administration can be found here Tools for editing R scripts Some good tools for using R R commander Pretty good tools to start using R and uses its basic methods. Tinn-R editor R studio Very powerful IDE based on the R language. Rattle Google style Style to code R statet Eclipse plugin for R Import/Export from/to R A summary about how to import/export files from/to R can be found here . In brief, you can *import from text files (csv, tsv files for instance) with dat <- read.csv(file,sep=',',check.names=F) *import from R data file (.Rdata) dat <- load(file) *import from xls file: "Two packages which can read and and manipulate Excel 2007/10 spreadsheets but not earlier formats are xlsx (which requires Java) and the Omegahat package RExcelXML. Package XLConnect can read, write and manipulate both Excel 97–2003 and Excel 2007/10 spreadsheets, requiring Java. Perl users have contributed a module OLE::SpreadSheet::ParseExcel and a program xls2csv.pl to convert Excel 95–2003 spreadsheets to CSV files. Package gdata provides a basic wrapper in its read.xls function. With suitable Perl modules installed this function can also read Excel 2007 spreadsheets." (see http://cran.r-project.org/doc/manuals/R-data.pdf) *import/export from statistical languages like **SPSS: write.foreign, read.spss **Stata: read.dta and write.dta **SAS: read.xport **S-PLUS: read.S **Octave: read.octave * import from databases: RMySQL, ROracle, RPostgreSQL, RSQLite, RpgSQL,RdbiPgSQL, PL/R is a project to embed R into PostgreSQL An example with MySQL library(RMySQL) con <- dbConnect(dbDriver("MySQL"), dbname = "test") dbListTables(con) data(USArrests) dbListTables(con) dbReadTable(con, "arrests") dbGetQuery(con, paste("select row_names, Murder from arrests", "where Rape > 30 order by Murder")) dbRemoveTable(con, "arrests") dbDisconnect(con) R on a webserver Rapache Rserve R to/from other languages Rpy, RPy2 for python Rjava for java Rmatlab for matlab 'Tips about R' *Do not use 'for' loops in R. They will take ages if you make too many iterations. Instead use the '*apply' methods (sapply, lapply, tapply, apply). lapply itterates along a list, sapply along a vector; apply and tapply have extra parameters to indicate which part of the array or vector to use. For apply it is the dimension of the array over which to iterate. Ex: apply(mymat,2,sum,na.rm=TRUE) will produce the sum of each column of the matrix, after missing values have been removed. tapply selects parts of a vector using factors tapply(myvector,myfactor,mean) will calculate the mean of values of myvector after splitting them into groups based on the different values of myfactor. As illustration, here is some code using sapply and apply testFPKM<-apply(mydat2:10,2:13,1,function(x{y1<-xc(1:4,11,12);y2<-xc(5:10);t.test(y1,y2)}) rawp0<-sapply(mydat,function(x) print(x$p.value)) *If you plot a very''' large number of points', it may be interesting to use the smoothing function scatter.smooth. This will speed up the plotting process and give you a density plot which may be very informative. *ifelse is very useful when dealing with vector. Ex: z<- ifelse(x,y/x,y/2) This gives zi <- yi/xi or zi <- yi/2 *To get several subplots (1 line of 2 subplots in this case), use par(mfrow=c(1,2)) If you really want to customize your plots, you can use par( fig = c(0, 0.45, 0, 0.80), mar = c(0, 0, 0, 0), cex = 0.8 ) plot(myfig1) # FIGURE 2 par( fig = c(0.46, 0.98, 0.03, 0.776), mar = c(0,0,0,0), new = T ) hist(myfig2) par( fig = c(0.44, 1, 0.81, 1), mar = c(0,0,0,0), cex = 0.8, new = T ) plot.new() text(mytext3) *It is well known that maximum likelihood estimate in general tend to underestimate variance parameters because they fail to adjust for the fact that the mean is estimated from the same data. This will not have much impact if the number of samples is large. * If you're looking up elements by index (and not value), you should stick with vectors: vectors are slightly faster than environments. However, if your program contains a large number of lookups by label, you should consider replacing vectors with environments. More information can be found on Joseph Adler's blog. An example of labeled arrays: labeled.array <- function(n) { a <- 1:n from <- "0123456789" to <- "ABCDEFGHIJ" for (i in 1:n) { names(a)i <- chartr(from,to,i) } a } a.20 <- labeled.array(20) a.20 An example of labeled environment: labeled.environment <- function(n) { e <- new.env(hash=TRUE,size=n) from <- "0123456789" to <- "ABCDEFGHIJ" for (i in 1:n) { assign(x=chartr(from,to,i), value=i, envir=e) } e } e.20<-labeled.environment(20) get("B",envir=e.20) Making block formula. Study the effect of treatment in all studies mymodel <- formula(classification ~ treatment | study) Writing new R packages A complete document is available here . R in action Data Manipulation http://www.stats.ox.ac.uk/~ruth/RCourse/Rcourse3.pdf Stuff to read table, play with matrices or vectors, get to some elements or parts of the matrix, create function, merge data frames. For matrix multiplication, use %*%. Inverses of square matrices can be found using the '''solve' command. eigen computes eigenvalues and eigenvectors of matrices. colSums, colMeans returns the column sums or means of a matrix. rowSums and rowMeans the sums or means of the rows. Can also be used on arrays http://cran.r-project.org/doc/contrib/usingR.pdf is a great tutorial to get started with the basics. Among other things, you will see what is the previous reference + how to plot, how to deal with NA elements in data frames, how to make simple plots, add lines to them, scatterplots (and its smooth version), histograms, boxplots, rugplots (see example below) . Other info is also listed in 'Data visualization ' Interesting stuff for visualisation can be found on the website: http://www.wekaleamstudios.co.uk/supplementary-material/ 'Linear models' http://cran.r-project.org/doc/contrib/usingR.pdf Simple linear regression. You will find out what is behind regression objects in R, how to make model formula in R, multiple linear regression, polynomial and spline regression. Simple linear models reslm<-lm(formula = benefit ~ hard + tens, data = Rubber) coef(reslm) # estimate Beta vcov(reslm) # covariance matrix of ^Beta resid<-residuals(reslm) # residuals fittedval<-fitted(reslm) # fitted values plot(fittedval,resid) # plot res. vs fit. values qqplot(resid) # residuals should follow a normal distribution (the line here) To plot the results of the regression, use termplot(Rubber.lm, partial=TRUE, smooth=panel.smooth) You'll get something like this If you want to add non-negative and/or non-positive constraints, you should use nnls. '' A <- as.matrix(data,c("var1","var2","var3")) b.res <- data$depVar nnls.res <- nnls(A,b.res) We can't use predict to generate our results as with ''lm and have to manually perform the matrix multiplication. This can be done as shown below. coef.res <- coef(nnls.res) pred.res.nnls <- as.vector(coef.res%*%t(A)) What order of polynomial? > seedrates.lm1<-lm(grain~rate,data=seedrates) > seedrates.lm2<-lm(grain~rate+I(rate^2),data=seedrates) > anova(seedrates.lm2,seedrates.lm1) Model 1: grain ~ rate + I(rate^2) Model 2: grain ~ rate Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 2 0.026286 2 3 0.187000 -1 -0.160714 12.228 0.07294 The F-value is large, but on this evidence there are too few degrees of freedom to make a totally convincing case for preferring a quadratic to a line. Maximum Likelihood (ML) estimation can be undertaken with the standard optim() function. Many other suitable methods are listed in the Optimization view. Check whether one data point has a large influence on your regression results with Cook's distance. The latter also tells you whether some regions are not very represented by your data points. Interesting R functions are cooks.distance() ''and ''influence.measures(). To get the best set of variables (independent variables) that explain another variable (dependent variable), you can use the function leaps() from the package leaps. Non linear models Non-linear least squares can be estimated with the nls() function, as well as with nlme() from the nlme package. Latent variable models A latent variable model is a statistical model that relates a set of variables (so-called manifest variables) to a set of latent variables (def. from Wikipedia). The R package "ltm" or "lava" can be used for this kind of analysis. Machine learning models Support vector machines R packages: *R-SVM is a SVM-based method for doing supervised pattern recognition(classification) with microarray gene expression data. *e1071: Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier *libsvm: allows Multi-class classification, using the one-against-one technique by fitting all binary subclassifiers and finding the correct class by a voting mechanism; Neural networks Regression and classification trees Recursive partitioning Def. by Wikipedia: Recursive partitioning is a statistical method for multivariable analysis.[1] Recursive partitioning creates a decision tree that strives to correctly classify members of the population based on several dichotomous dependent variables. rpart package can be used for regression as well as for classification with the following methods: *categorical: method "class" *continuous: method "anova" *Poisson Process/Count: method "poisson" *Survival: method "exp" A typical use of rpart is: library(rpart) fit <- rpart(mycount ~ age + sex , method='Poisson') print(fit) # text version of the tree summary(fit) # summary of each node in depth plot(fit) # plot the tree text(fit) # plot labels under leafs post(fit,file='') # plot a prettier version of the tree A good tutorial about how to use ''rpart is here. Pruning the tree may be interesting if the tree is too complex without improving significantly the error. fit_prune <- prune(fit, cp = cp) plot(fit_prune, uniform = TRUE, margin = 0.1,branch = 0.5, compress = TRUE) text(fit_prune) To make a forest (with bagging procedure): trees <- vector(mode = "list", length = 25) n <- nrow(mytdata) bootsamples <- rmultinom(length(trees), n, rep(1, n)/n) mod <- rpart(Class ~ ., data = mydata, control = rpart.control(xval = 0)) for (i in 1:length(trees)) treesi <- update(mod, weights = bootsamples,i) classprob <- matrix(0, nrow = n, ncol = length(trees)) for (i in 1:length(trees)) { classprob,i <- predict(treesi,newdata = mydata),1 } avg <- rowMeans(classprob, na.rm = TRUE) predictions <- factor(ifelse(avg > 0.5, "abnormal", "normal")) predtab <- table(predictions, mydata$Class) Bagging is a special case of the more general method Random Forest. Comparing populations Normal distribution Be careful to test for normality beforehand using the qqplot or the Shapiro test. Here is an example for a paired t-test where we test the normality of the difference (diff). boxplot(x$diff) qqnorm(x$diff) qqline(x$diff) shapiro.test(x$diff) Classical cases: few samples, few variables You're a lucky guy ... your distro follows a normal distribution. If populations are normally distributed and the variances are equal we can use the t-test with the test statistic: t = \frac{\overline{y}_1 - \overline{y}_2 }{s \sqrt{1/n_1+1/n_2}} \tilde t_{n_1+n_2-2}, where s'' is the Standard deviation of the sample and ''n is the sample size. The degrees of freedom used in this test is n'' − 1. If the variances are not equal, it is preferable to use the test statistic (Welch test): t = \frac{\overline{y}_1 - \overline{y}_2 }{\sqrt{s_1^2/n_1+s_2^2/n_2}} \tilde t_{\nu}, If the data are paired (still normally distributed), we have the paired test statistic: \frac{\overline{d}}{s \sqrt{1/n}} \tilde t_{n-1} This website explains how to use the t-test. If you have more than 2 groups, you will be using ANOVA aov(depVar ~ indepVar1 * indepVar2, data = mydat) If there are more than 1 dependent variable, use MANOVA res <- manova(cbind(depVar1, depVar2,depVar3) ~ indepVar1 * indepVar2,data =mydata) sapply(c("Pillai", "Wilks", "Hotelling-Lawley", "Roy"),function(test)summary(res, test = test)$stats1,6) '''Wilcoxon Mann-Whitney rank sum test' applies the t-statistic to the joint ranks of all measurements in both groups instead of the original measurements. The Wilcoxon signed-rank statistic is based on the ranks of the absolute differences |di|. The statistic is defined as the sum of the ranks associated with positive difference di > 0. It should be noted that this test is only valid when the differences di are symmetrically distributed. An interesting tool for multiple testing allowing to use either of these methods is "multtest". After testing with mt.teststat(yourData,classlabel=yourClassLabels) You can plot the test statistics using qqplot to see whether your distribution follow a normal or to see variables with "unusual" test statistics. "multtest" also provides tools for multiple testing: The procedures include the Bonferroni, Holm (1979), Hochberg (1988), and Sidak procedures for strong control of the family–wise Type I error rate (FWER) (false negative), and the Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001) procedures for (strong) control of the false discovery rate (FDR, false positive). Adjusted p–values for these seven multiple testing procedures can be computed as follows and stored in the original variable order in adjp using order(res$index): procs <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD", "BH", "BY") res <- mt.rawp2adjp(rawp0, procs) adjp <- res$adjporder(res$index), where rawp0 are your p-values considering the null hypothesis. If you don't like these procedures and have some time to spend calculating permutations you can use mt.maxT and mt.minP. The mt.maxT and mt.minP functions compute permutation adjusted p–values for the maxT and minP step–down multiple testing procedure described in Westfall and Young (1993). There are thus in general less conservative than the standard Bonferroni procedure. This means that it allows more easily positive event reducing the false negative rate but potentially increasing the false positive rate. The permutation algorithm for the maxT and minP procedures is described in Ge et al. (2003). resT <- mt.maxT(yourData) More complex case : few samples, many variables If the distribution follows approximately a normal distribution and you're dealing with discrete and skewed data (for example when working with count data like in differential gene expression), you can use a test based on a negative binomial distribution. The advantage of this distribution over the Poisson distribution is that it allows to model biological variance correctly. In clear, if you plot the log2fold against the mean of your variable, you may see that the spead of the points is not equal from one variable to another (see plot below taken from the vignette of DESeq). In this case the Poisson distribution will not be able to model the variation in the variances between the variables. DESeq is a great tool to model almost any type of variance. EdgeR is also a great tool but will only model linear regression in the variance. In short, use DESeq or EdgeR when you deal with many variables, only a few samples and you expect the variance or dispersion to vary along the variables (ex: variance for gene 1 is different from variance for gene 2, etc). These packages are based on the so-called quantile-adjusted conditional maximum likelihood (qCML) which gives smaller bias than the maximum likelihood (ML). If the number of samples increases a conditional maximum likelihood (CML) estimate will be as good as the qCML. Not normal distribution Fisher-exact test can be used if you work with contingency tables where sample sizes are small. Use fisher.test in R. http://cs.brown.edu/system/software/latex/doc/xtab.pdfxstab may help building up contingency table from a data frame. With large samples, a chi-square test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. Multiple test correction An interesting R package is {multtest}. Suppose genes are our variables *'Bonferroni' Corrected P-value= p-value * n (number of genes in test) <0.05 The Bonferroni correction is the most stringent test of all, but offers the most conservative approach to control for false positives. *'Bonferroni Step-down (Holm) correction' #The p-value of each gene is ranked from the smallest to the largest. #The first p-value is multiplied by the number of genes present in the gene list: if the end value is less than 0.05, the gene is significant: Corrected P-value= p-value * n < 0.05 #The second p-value is multiplied by the number of genes - 1: Corrected P-value= p-value * n-1 < 0.05 #The third p-value is multiplied by the number of genes less 2: Corrected P-value= p-value * n-2 < 0.05 It follows that sequence until no gene is found to be significant. *'Westfall and Young Permutation' Both Bonferroni and Holm methods are called single-step procedures, where each p- value is corrected independently. The Westfall and Young permutation method takes advantage of the dependence structure between genes, by permuting all the genes at the same time (reassignments of samples to groups). The Westfall and Young Permutation is a correction accounting for genes coregulation. However, it is very slow and is also very conservative. The Westfall and Young permutation follows a step-down procedure similar to the Holm method, combined with a bootstrapping method to compute the p-value distribution: #P-values are calculated for each gene based on the original data set and ranked. #The permutation method creates a pseudo-data set by dividing the data into artificial treatment and control groups. #P-values for all genes are computed on the pseudo-data set. #The successive minima of the new p-values are retained and compared to the original ones. #This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudo-p-value is less than the original p-value is the adjusted p-value. Because of the permutations, the method is very slow. The Westfall and Young permutation method has a similar Family-wise error rate as the Bonferroni and Holm corrections. *'Benjamini and Hochberg False Discovery Rate' This correction is the least stringent than Bonferroni or Bonferroni Step-down (Holm) '''or '''Westfall and Young Permutation, and therefore tolerates more false positives. There will be also less false negative genes. It provides a good balance between discovery of statistically significant genes and limitation of false positive occurrences. Benjamini and Hochberg False Discovery Rate 'does not take into account gene corregulation. Here is how it works: #The p-values of each gene are ranked from the smallest to the largest. #The largest p-value remains as it is. #The second largest p-value is multiplied by the total number of genes in gene list divided by its rank. If less than 0.05, it is significant. Corrected p-value = p-value*(n/n-1) < 0.05, if so, gene is significant. #The third p-value is multiplied as in step 3: Corrected p-value = p-value*(n/n-2) < 0.05, if so, gene is significant. And so on. *'Benjamini–Hochberg–Yekutieli This correction does take into account that the tests are dependent. It is more conservative than the Benjamini-Hochberg multiple test correction. #The p-values of each gene are ranked from the smallest to the largest. #The largest p-value remains as it is. #The second largest p-value is multiplied by the total number of genes in gene list divided by its rank but corrected. If less than 0.05, it is significant. Corrected p-value = p-value*(n/((n-1)* \sum_{i=1}^{n-1}1/i ) < 0.05, if so, gene is significant. #The third is multiplied by the total number of genes in gene list divided by its rank but corrected. If less than 0.05, it is significant. Corrected p-value = p-value*(n/((n-2)* \sum_{i=1}^{n-2}1/i ) < 0.05, if so, gene is significant. And so on. Post-Hoc testing of ANOVA Multiple comparison procedures are commonly used in an analysis of variance after obtaining a significant omnibus test result, like the ANOVA F-test. The significant ANOVA result suggests rejecting the global null hypothesis H0 that the means are the same across the groups being compared. Multiple comparison procedures are then used to determine which means differ. In a one-way ANOVA involving K'' group means, there are ''K(K'' −1)/2 pairwise comparisons. *'Tukey's test ''' If your data are independent and homoscedastic (equal variances), Tukey's test will recalculate the confidence limits. So in this case, you do not correct the alpha (p-value threshold) afterwards but calculate the p-value given the new confidence limits. TukeyHSD(aov(depVar ~ indepVar1 * indepVar2,data = mydata), "indepVar1") *Scheffé's test Can be used for any contrasts, not only pairwise comparisons as for Tukey's test. It is very powerful for ANOVA. If only pairwise comparisons are to be made, the Tukey–Kramer method will result in a narrower confidence limit, which is preferable. In the general case when many or all contrasts might be of interest, the Scheffé method tends to give narrower confidence limits and is therefore the preferred method. R in real-life applications A list of applications using R can be found on the below listed websites: *http://blog.revolutionanalytics.com/2011/07/gigaom-article-on-r-big-data-and-data-science.html Drew Conway's analysis of the Afghanistan attacks data released by Wikileaks. Analyzing attack data may help localize enemy troops. *http://www.inside-r.org/packages/cran/stratasphere A statistical technique at Google used to evaluate the factors that lead to user satisfaction of satisfaction with search reports, or when users are asked to rate YouTube videos. *http://www.google.org/flutrends/ One interesting application is the Google Flu Trends project, which uses R to estimate current flu activity based on Google search results. Google Trends aggregates user search queries showing how often a particular word or phrase has been searched. Correlation tests are run on the search results to obtain a manageable data set of potentially relevant variables. Then using R, they massage the data and create models with optimized weights for each search term. From this, they are able to reasonably estimate current flu activity for different regions around the world. *http://www.r-bloggers.com/opendata-r-google-easy-maps/ Easy map using google tools. See http://code.google.com/apis/chart/interactive/docs/gallery.html for some examples. For including googleVis output into a blogger post, have a lookhere. *http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/ Text data mining with twitter